Persistence

In this notebook, our focus is on persistent data — the kind that outlives a program that creates it.

That’s not true by default for objects a script constructs, of course; things like lists, dictionaries, and even class instance objects live in your computer’s memory and are lost as soon as the script ends.

To make data live longer, we need to do something special.

In Python there are (at least) five traditional ways to save information in between program executions:

  • Flat files

    • Text and bytes stored directly on your computer
  • DBM keyed files

    • Keyed access to strings stored in dictionary-like files
  • Pickled objects

    • Pickled objects
  • Shelve files

    • Pickled Python objects saved in DBM keyed files
  • SQL relational databases (RDBMSs)

    • Table-based storage that supports SQL queries (SQLite, MySQL, PostGreSQL, etc.)

SQL Database Interface

For programs that can benefit from the power of SQL, Python also broadly supports relational database management systems (RDBMSs).

The databases we’ll meet in this notebook, though, are structured and processed in very different ways:

  • They store data in related tables of columns (rather than in persistent dictionaries of arbitrarily structured persistent Python objects).

  • They support the SQL query language for accessing data and exploiting relation- ships among it (instead of Python object traversals).

For some applications, the end result can be a potent combination. Moreover, some SQL-based database systems provide industrial-strength persistence support for enterprise-level data.

Python Modules

Today, there are freely available interfaces that let Python scripts utilize all common relational database systems, both free and commercial: MySQL, Oracle, Sybase, Informix, InterBase, PostgreSQL (Postgres), SQLite, ODBC, and more.

In addition, the Python community has defined a database API (aka DB API) specification that works portably with a variety of underlying database packages.

Scripts written for this API can be migrated to different database vendor packages, with minimal or no source code changes.

As of Python 2.5, Python itself includes built-in support for the SQLite relational database system as part of its standard library.

Because this system supports the portable database API, it serves as a tool for both program storage and prototyping—systems developed with SQLite work largely unchanged when a more feature-rich database such as MySQL or Oracle is deployed.

Moreover, the popular SQLObject and SQLAlchemy third-party systems both provide an Object Relational Mapper (ORM), which grafts an object interface onto your database, in which tables are modeled by as Python classes, rows by instances of those classes, and columns by instance attributes.

Python SQL Interface

The Python Database API (DB API) specification defines an interface for communicating with underlying database systems from Python scripts.

Vendor-specific database interfaces for Python may or may not conform to this API completely, but all database extensions for Python in common use are minor variations on a theme.

Under the database API, SQL databases in Python are grounded on three core concepts:

  • Connection Objects:

    Represent a connection to a database, are the interface to rollback and commit operations, provide package implementation details, and generate cursor objects.

  • Cursor Objects:

    Represent an SQL statement submitted as a string and can be used to access and step through SQL statement results.

  • Query results of SQL select statements:

    Are returned to scripts as Python sequences of sequences (e.g., a list of tuples), representing database tables of rows. Within these row sequences, column field values are normal Python objects such as strings, integers, and floats (e.g., [('bob', 48), ('emily',47)]). Column values may also be special types that encapsulate things such as date and time, and database NULL values are returned as the Python None object.

Connection Objects

Beyond this, the API defines a standard set of database exception types, special database type object constructors, and informational top-level calls including thread safety and replacement style checks.

For instance, to establish a database connection under the Python API-compliant Oracle interface, install the commonly used Python Oracle extension module (i.e. pip install cx_oracle) as well as Oracle itself, and then run a statement of this form:

connobj = connect("user/password@system")

This call’s arguments may vary per database and vendor (e.g., some may require network details or a local file’s name), but they generally contain what you provide to log in to your database system.

Once you have a connection object, there a variety of things you can do with it, including:

connobj.close()     # close connection now (not at object __del__ time)
    connobj.commit()    # commit any pending transactions to the database 
    connobj.rollback()  # roll database back to start of pending transactions

Cursor Objects

But one of the most useful things to do with a connection object is to generate a cursor object:

cursobj = connobj.cursor()  # return a new cursor object for running SQL

Cursor objects have a set of methods, too (e.g., close to close the cursor before its destructor runs, and callproc to call a stored procedure), but the most important may be this one:

cursobj.execute(sqlstring [, parameters])  # run SQL query or command string

Parameters are passed in as a sequence or mapping of values, and are substituted into the SQL statement string according to the interface module’s replacement target conventions.

The execute method can be used to run a variety of SQL statement strings:

  • DDL definition statements (e.g., CREATE TABLE);
  • DML modification statements (e.g., UPDATE or INSERT);
  • DQL query statements (e.g., SELECT)

After running an SQL statement, the cursor’s rowcount attribute gives the number of rows changed (for DML changes) or fetched (for DQL queries), and the cursor’s description attribute gives column names and types after a query; execute also returns the number of rows affected or fetched in the most vendor interfaces.

For DQL query statements, you must call one of the fetch methods to complete the operation:

single_tuple = cursobj.fetchone()          # fetch next row of a query result 
    list_of_tuple = cursobj.fetchmany([size])  # fetch next set of rows of query result
    list_of_tuple = cursobj.fetchall()         # fetch all remaining rows of the result

And once you’ve received fetch method results, table information is processed using normal Python sequence operations; for example, you can step through the tuples in a fetchall result list with a simple for loop or comprehension expression.

Most Python database interfaces also allow you to provide values to be passed to SQL statement strings, by providing targets and a tuple of parameters. For instance:

query = 'SELECT name, shoesize FROM spam WHERE job = ? AND age = ?' 
cursobj.execute(query, (value1, value2))
results = cursobj.fetchall()
for row in results: 
    pass  # do something

In this event, the database interface utilizes prepared statements (an optimization and convenience) and correctly passes the parameters to the database regardless of their Python types.

The notation used to code targets in the query string may vary in some database interfaces (e.g., :p1 and :p2 or two %s, rather than the two ?s used by the Oracle interface); in any event, this is not the same as Python’s % string formatting operator, as it sidesteps security issues along the way.

Finally, if your database supports stored procedures, you can call them with the callproc method or by passing an SQL CALL or EXEC statement string to the execute method.

callproc may generate a result table retrieved with a fetch variant, and returns a modified copy of the input sequence — input parameters are left untouched, and output and input/output parameters are replaced with possibly new values.

Additional API features, including support for database blobs (roughly, with sized results), is described in the API’s documentation.

For now, let’s move on to do some real SQL processing in Python.

An SQL Database API Tutorial with SQLite

We don’t have space to provide an exhaustive reference for the database API in this notebook.

To sample the flavor of the interface, though, let’s step through a few simple examples.

We’ll use the SQLite database system for this tutorial.

SQLite is a standard part of Python itself, which you can reasonably expect to be available in all Python installations. Although SQLite implements a complete relational database system, it takes the form of an in-process library instead of a server.

This generally makes it better suited for program storage than for enterprise-level data needs.

Note:

Thanks to Python’s portable DB API, though, other popular database packages such as PostgreSQL, MySQL, and Oracle are used almost identically; the initial call to log in to the database will be all that normally requires different argument values for scripts that use standard SQL code.

Because of this, we can use the SQLite system both as a prototyping tool in applications development and as an easy way to get started with the Python SQL database API in this book.

Getting Started

Regardless of which database system your scripts talk to, the basic SQL interface in Python is very simple.

In fact, it’s hardly object-oriented at all queries and other database commands are sent as strings of SQL.

Whether large or small, though, the Python code needed to process your database turns out to be surprisingly straightforward.

To get started, the first thing we need to do is open a connection to the database and create a table for storing records:


In [1]:
import sqlite3
conn = sqlite3.connect('data/dbase1')

We start out by importing the Python SQLite interface here— it’s a standard library module called sqlite3 to our scripts.

Next we create a connection object, passing in the items our database requires at start-up time—here, the name of the local file where our databases will be stored.

This file is what you’ll want to back up to save your database. It will create the file if needed, or open its current content; SQLite also accepts that special string :memory: to create a temporary database in memory instead.

As long as a script sticks to using standard SQL code, the connect call’s arguments are usually the only thing that can vary across different database systems.

For example, in the MySQL interface this call accepts a network host’s domain name, user name, and password, passed as keyword arguments instead, and the Oracle example sketched earlier expects a more specific sting syntax.

Once we’ve gotten past this platform-specific call, though, the rest of the API is largely database neutral.

Making Database and Tables

Next, let’s make a cursor for submitting SQL statements to the database server, and submit one to create a first table:


In [2]:
curs = conn.cursor()
try:
    curs.execute('drop table people')
except:
    pass  # did not exist
curs.execute('create table people (name char(30), job char(10), pay int(4))')


Out[2]:
<sqlite3.Cursor at 0x1045d3c00>

The last command here creates the table called “people” within the database; the name, job, and pay information specifies the columns in this table, as well as their datatypes, using a “type(size)” syntax — two strings and an integer.

Datatypes can be more sophisticated than ours, but we’ll ignore such details here.

In SQLite, the file is the database, so there’s no notion of creating or using a specific database within it, as there is in some systems.

Adding Records

There are three basic statement-based approaches we can use here:

  • inserting one row at a time;
  • inserting multiple rows with a single call statement;
  • using a Python loop.

Here is the simple case


In [3]:
curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 50000))
curs.rowcount


Out[3]:
1

In [4]:
sqlite3.paramstyle


Out[4]:
'qmark'

Here, qmark means this module accepts ? for replacement targets.

Other database modules might use styles such as format (meaning a %s target), or numeric indexes or mapping keys; see the DB API for more details.

To insert multiple rows with a single statement, use the executemany method and a sequence of row sequences (e.g., a list of lists). This call is like calling execute once for each row sequence in the argument, and in fact may be implemented as such; database interfaces may also use database-specific techniques to make this run quicker, though:


In [5]:
curs.executemany('insert into people values (?, ?, ?)', 
                     [ ('Sue', 'mus', '70000'),
                       ('Ann', 'mus', '60000')])
curs.rowcount


Out[5]:
2

We inserted two rows at once in the last statement.

It’s hardly any more work to achieve the same result by inserting one row at a time with a Python loop:


In [6]:
rows = [['Tom', 'mgr', 100000], ['Kim', 'adm', 30000], ['pat', 'dev', 90000]]
for row in rows:
    curs.execute('insert into people values (? , ?, ?)', row)
conn.commit()

Blending Python and SQL like this starts to open up all sorts of interesting possibilities.

Notice the last command; we always need to call the connection’s commit method to write our changes out to the database. Otherwise, when the connection is closed, our changes may be lost.

In fact, until we call the commit method, none of our inserts may be visible from other database connections.

Side note:

Technically, the API suggests that a connection object should automatically call its rollback method to back out changes that have not yet been committed, when it is closed (which happens manually when its close method is called, or automatically when the connection object is about to be garbage collected).

For database systems that don’t support transaction commit and rollback operations, these calls may do nothing. SQLite implements both the commit and rollback methods; the latter rolls back any changes made since the last commit.

Running Queries


In [7]:
curs.execute('select * from people')
for row in curs.fetchall():
    print(row)


('Bob', 'dev', 50000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

Tuple unpacking comes in handy in loops here, too, to pick out column values as we go.

Here’s a simple formatted display of two of the columns’ values:


In [8]:
curs.execute('select * from people')
for (name, job, pay) in curs.fetchall():
    print(name, ':', pay)


Bob : 50000
Sue : 70000
Ann : 60000
Tom : 100000
Kim : 30000
pat : 90000

Because the query result is a sequence, we can use Python’s powerful sequence and iteration tools to process it.

For instance, to select just the name column values, we can run a more specific SQL query and get a list of tuples


In [9]:
curs.execute('select name from people')
names = curs.fetchall()
names


Out[9]:
[('Bob',), ('Sue',), ('Ann',), ('Tom',), ('Kim',), ('pat',)]

fetchall vs fetchone

The fetchall call we’ve used so far fetches the entire query result table all at once, as a single sequence (an empty sequence comes back, if the result is empty).

That’s convenient, but it may be slow enough to block the caller temporarily for large result tables or generate substantial network traffic if the server is running remotely (something could easily require a parallel thread in GUI).

To avoid such a bottleneck, we can also grab just one row, or a bunch of rows, at a time with fetchone and fetchmany.

The fetchone call returns the next result row or a None false value at the end of the table:


In [10]:
curs.execute('select * from people')
while True:
    row = curs.fetchone() 
    if not row: 
        break 
    print(row)


('Bob', 'dev', 50000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

In [11]:
curs.execute('select * from people')
for row in curs:
    print(row)


('Bob', 'dev', 50000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

The fetchmany call returns a sequence of rows from the result, but not the entire table; you can specify how many rows to grab each time with a parameter or rely on the default as given by the cursor’s arraysize attribute.

Each call gets at most that many more rows from the result or an empty sequence at the end of the table:


In [13]:
curs.execute('select * from people')
while True:
    rows = curs.fetchmany()  # size=N optional argument
    if not rows: 
        break
    for row in rows:
        print(row)


('Bob', 'dev', 50000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

More sophisticated Example


In [14]:
curs.description


Out[14]:
(('name', None, None, None, None, None, None),
 ('job', None, None, None, None, None, None),
 ('pay', None, None, None, None, None, None))

In [15]:
curs.execute('select * from people')
colnames = [desc[0] for desc in curs.description]
for row in curs:
    print('-' * 30)
    for (name, value) in zip(colnames, row):
        print('%s => %s' % (name, value))


------------------------------
name => Bob
job => dev
pay => 50000
------------------------------
name => Sue
job => mus
pay => 70000
------------------------------
name => Ann
job => mus
pay => 60000
------------------------------
name => Tom
job => mgr
pay => 100000
------------------------------
name => Kim
job => adm
pay => 30000
------------------------------
name => pat
job => dev
pay => 90000

Exercise:

Convert list of row tuples to list of row dicts with field name keys.


In [ ]:
# %load files/makedicts.py
"""
convert list of row tuples to list of row dicts with field name keys
"""
import sqlite3

def makedicts(cursor, query, params=()):
    cursor.execute(query, params)
    colnames = [desc[0] for desc in cursor.description]
    rowdicts = [dict(zip(colnames, row)) for row in cursor]
    return rowdicts

conn = sqlite3.connect('data/dbase1')
cursor = conn.cursor()
query  = 'select name, pay from people where pay < ?'
lowpay = makedicts(cursor, query, [70000])
for rec in lowpay: 
    print(rec)

Exercise no. 2

Load Data from Text files: load table from comma-delimited text file.


In [18]:
!cat data/data.txt





kim,devel,60000

In [19]:
!cat data/data2.txt



ann,manager,80000